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Abstract 

Tiling arrays make possible a large scale exploration of the genome 
thanks to probes which cover the whole genome with very high density un- 
til 2 000 000 probes. Biological questions usually addressed are either the 
expression difference between two conditions or the detection of transcribed 
regions. In this work we propose to consider simultaneously both questions 
as an unsupervised classification problem by modeling the joint distribution 
of the two conditions. In contrast to previous methods, we account for all 
available information on the probes as well as biological knowledge like 
annotation and spatial dependence between probes. Since probes are not bi- 
ologically relevant units we propose a classification rule for non-connected 
regions covered by several probes. Applications to transcriptomic and ChlP- 
chip data of Arabidopsis thaliana obtained with a NimbleGen tiling array 
highlight the importance of a precise modeling and the region classification. 



Keywords: Bivariate Gaussian mixture; Hidden Markov model; Tiling arrays; Unsu- 
pervised classification. 
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1 Introduction 



For 15 years, the study of large-scale genome is possible thanks to DNA microar- 
rays. The probes, originally designed on genes, now cover the whole genome 
without a priori knowledge of structural annotation: these are tiling arrays. The 
density is still increasing and companies now offer tiling arrays with 2 millions of 
probes. Thanks to technological advances and to the miniaturization of the sup- 
port, tiling arrays have become a usual tool in biology laboratories. They make 
possible a large-scale exploration of the genome with a reasonable cost. Recently, 
the Next Generation Sequencing (NGS) technology revolutionize the domain be- 
cause it directly products nucleotide sequences. However, like any new technol- 
ogy, it remains expensive and suffers for now from uncontrolled technical biases 
(Oshlack, Robinson and Young! 2010| ). It also raises new questions on read map- 
ping or genome assembly. For all these reasons, tiling arrays, with a technology 
well controlled, remain widely used. They are a powerful tool to analyse all kinds 
of experiments and are used in a wide range of studies like DNA methylation, 
chromatin modification or transcription factor analysis with ChlP-chip experi- 
ments (Buc k and Lieb[ [2004 ), DNA copy number variation detection with CGH 
( |Pinkel efaf. 1998 Snijders et al.\ |2001[ ) and survey of genomic transcriptional 
activities or transcript mapping with transcriptional experiments ( |Mockler et al.\ 

20U71 ). 



2005 Yamada^a/. 2003 , Hanada et al. 



For comparative genomic hybridization, many different approaches exist for 
determining DNA copy number variations in CGH data like segmentation (Hupe 



[gFalj [20041 |Picard gfa/.[ [2005] ) or Hidden Markov Models (Fridlyan d gf a/.[|2"004 



Seifertgf a/4|2009l ). 



For ChlP-chip experiments where the chromatin immunoprecipitation (ChIP) 
sample and the reference sample of genomic DNA are compared, the main goal is 
to detect regions enriched by ChIP John son gf q/.| ( |2"0 06) proposed a Model-based 
Analysis of Tiling arrays (MAT) algorithm dedicated to Affymetrix arrays. MAT 
models the baseline probe behavior based on probe sequence characteristics and 
genome copy number. Li et al. ( 2005| ) proposed to model the behavior of each 
probe and then a 2-state HMM is used to estimate the enrichment probability at 
each probe location. In these two methods it is assumed that only a small propor- 
tion of probes is enriched by ChIP This assumption is reasonable for ChlP-chip 
experiments dealing with transcription factor but not for histone modification or 
DNA methylation where a large enrichment is expected. Humburg et a/.|([2008) 



has suggested a parameter estimation procedure for robust HMM analysis of chro- 
matin structure where several long regions of interest are expected. ChlP-chip data 
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can also be seen as one signal along the genome when using the log-ratio between 
the intensities of the ChlP and the reference samples. Analyses are then usually 
done using a sliding window ( Cawley et al.\ 2004| ) and statistic tests. Keles et al. 



( |2004[ ) and |He et al.\ (|2~009) have proposed respectively the Welch t-statistic and 
the non parametric Wilcoxon rank-sum method. The log-ratio is a relevant quan- 
tity only when its distribution is bimodal which is an ideal case not often obtained 
with real datasets. To overcome this problem Martin-Magniette et al. (2008) pro- 
posed modeling the distribution of the signal of the ChlP sample conditional to 
the reference signal by a bidimensionnal mixture of regression. |Marti n-Magniette 



et al. ( 2008[ ) also proposed controlling a false positive risk. To perform com- 



parative ChlP-chip study, two ChlP samples can also be directly compared with 



a bidimensionnal mixture model to study the differential enrichment ( |Johannes 



2010). The two samples then play symmetric roles. 



Transcriptomic experiments may have two different purposes: the detection 
of transcribed regions or the study of gene expression across several conditions 
(also called differential analysis). Most methods previously developed for tiling 
array transcriptomic data deal with the first purpose. Among them, some methods 



are based on probe-by-probe statistical tests (e.g. Fisher test developed by Halasz 



et al. (2006)) and others are based on segmentation methods such as |Huber et al. 



( [2006] ) or |Zeller ef a^ ( [2008] ) or HMM (Nic olas et aL\ [20091 ) • Tne incorporation 
of the annotation knowledge has also been proposed in a supervised framework 
(Du et al., 2006, Munch et a/.||2006 ). Surprisingly few methods are devoted to 
the study of gene expression profiles across samples based on tiling arrays. Some 
methods aggregate probes within regions and then resume to hypothesis testing. 



The method gSAM (Ghosh et al. 2007) is an extension of SAM, which models 



the differential expression of a given region by a constant piece-wise function. In 
the TileMap method ( |Ji and Wong[ [2005 ) each probe is used separately and a test 
statistic is proposed, based on a hierarchical empirical Bayes model. 



In this article, we consider simultaneously the expression difference between 
two conditions and the detection of transcribed regions with an unsupervised clas- 
sification point of view. We study the difference between two ChlP samples or 
between two transcriptomic samples. Comparing the two samples requires distin- 
guishing four different biologically interpretable groups of probes: a group with 
similar behavior in both samples, a group with higher intensity in the first sample 
than in the second sample, a symmetric group with higher intensity in the second 
sample and a last group with low intensity in both samples which can be viewed 
as noise, corresponding to the non-transcribed regions (cf Figure [T]). 
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Figure 1: Schematic explanation of the 4 groups to consider when comparing two 
samples. Example of transcriptomic data. 

A parametric classification method based on multivariate mixture models per- 
mits a direct comparison of two (or more) samples. Moreover this differs from a 
log-ratio based study which does not distinguish the group of identical behavior 
from the noise group. Therefore our method directly models the intensities of the 
two samples for all probes. In contrast to previous methods, we consider all the 
available information: the intensity of the two signals, the position of the probe 
along the genome and its structural annotation. The position of the probe is impor- 
tant because there is a signal dependence between adjacent probes due to the high 
resolution of tiling arrays. Structural annotation informs us about the location of 
the probes in intergenic, exonic or intronic regions (see Figure[2} screen capture of 



FLAGdb++ (Sa mson et a/.[|2004| )). This must be accounted for as, in a transcrip- 



tomic experiment, probes annotated as exonic are more likely to be hybridized 
whereas intergenic or intronic (non-coding) probes should be mainly in the noise 
group. We use a 4-state heterogenous hidden Markov model with bidimensional 
Gaussian emission densities to gather all this information. Finally since genome 
annotation is an on-going process with possible errors, we will discuss the rele- 
vance of its use for each specific application. 
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Figure 2: Example of genome annotation. Grey squares: probes; black arrows: 
known genes. The arrows correspond to exons and the fine lines between arrows 
correspond to introns. 



Most methods provide probe-by-probe results. As for the classification pur- 
pose, the HMM provides an answer for each probe, via the posterior probabilities. 
However probes are not relevant units from a biological point of view. Although 
HMM are widely used for classification problems in genomic data, the classifica- 
tion of biologically interesting regions is not a common practice. To get a result 
by region, the most common used method is a sliding window approach where 



the probe signals are merged a priori. Another method proposed by Li et al 



(2005 ) is to defined a region as at least two probes with positive log-odds enrich- 
ment value in ChIP sample and at least one probe with log-odds enrichment value 
lower than — 15 in the control sample. But these methods do not deal with regions 
covered by several non-adjacent probes, such as genes with exons and introns. 
We propose a new solution deriving a posterior probability for a region given a 
priori with arbitrary structure (like a non-connected region) and also a procedure 
of gene classification which allows to get quickly a list of differentially expressed 
genes. This calculation by region really improves results than deducing a result 
from probe classifications a posteriori. 

The article is organized as follows. The statistical model is described in Sec- 
tion 



2.1| The inference is given in Section |2.2| Section |2.3| describes the clas- 
sification method for a probe and a gene. In Section [3} we discuss the different 
sub-models of the method and apply them on NimbleGen tiling arrays for tran- 
scriptomic and ChlP-chip data of the plant Arabidopsis thaliana. The main con- 
clusions and some possible extensions are discussed in Section |4} The method is 
implemented in R and C and is available upon request. 



2 Methods 

We propose a non- supervised classification model to compare the intensities of the 
two samples hybridized on the array for each probe. It accounts for all available 
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information for each probe: the two intensities to be compared, the position of 
the probe along the chromosome and the current annotation of the probe (for ex- 
ample exonic, intronic, intergenic, transposable element, etc). As recommended 
in |Johannes (2010), our method does not deal with the usual log-ratio but rather 



considers the two intensities separately and the joint signal of the two samples is 
modeled. 



2.1 Model 

For probe t, we denote 

• X t = (X t i,Xt2) the log-intensities for both samples, 

• Q E {1, ...P} the annotation category, 

• Zf G {1, ...K} the unknown status. 

In our case, K = 4, Groups 1 and 2 will refer respectively to 'noise' and 'identical' 
probes, whereas Groups 3 and 4 will refer to differentially hybridized probes. To 
account for the dependence between adjacent probes, we assume that the process 
{Z t } is a first order Markov chain with heterogenous transition it p depending on 
the annotation category: 

P(Z t = l\Z t ^ = k,Q = p) = nP 

The proportion of observations in each group for each annotation category is given 
by the stationary distribution mP of the corresponding transition matrix % p . We 
then assume that the {X t } are independent conditionally to the {Z t } with distribu- 
tion 

The parameters \l k and Z k of the Gaussian do not depend on the annotation cate- 
gory. 

If there is no spatial dependence, the {Z t } are independent and distributed 
according to a multinomial of parameter nl where 7if corresponds to the propor- 
tion of probes from group k in annotation category p. If there is no annotation 
and no spatial dependence, the model comes down to a mixture model with four 
components. All these sub-models are discussed in Section[3j 
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2.2 Inference 



We use the parametrization proposed by Banfield and Raftery (1993]) which en- 
ables us to characterize geometric properties of the Gaussian density (volume, 
shape, orientation). This parametrization considers the eigenvalue decomposition 
of the variance matrix of the group k: 

L k = D k A k D k . 

The matrix describes both the volume and the shape of the ellipse associated 
with the Gaussian distribution. The matrix D k decribes the orientation of this 
ellipse. A similar decomposition of variance matrix is studied by |Celeux and 



Govaert| ( 1995 ) in the Gaussian mixture context and is implemented in the Mix- 
mod software (Biernacki et ah, 2007) and in the Mclust R package (Fraley and 
Raftery[ 2006| ). In their approach, each term of the decomposition is either equal 
in all groups or specific to each group. 

In our case we need an intermediate modeling. By definition groups 1 and 2 
should have the same orientation (see Figure [T]), which implies that D\ = D 2 . 
Furthermore the dispersion around the main axis is expected to be similar in all 
groups, which amounts to fixing the second eigenvalue of L k for all groups. This 
can be summarized as 



At 



D k A k D' k , 
u 2 



for fc = 1,...,4; 



with u\ k > M2,for k= 1, ...,4. 



The parameters {it p }, {jJ-k}, {D k }, {u\ k } and u 2 are estimated using the EM 



algorithm. The E-step is achieved with the Forward-Backward algorithm (Baum 



( 1972), Rabiner ( 1989 )). This model requires a specific M-step to satisfy the pre- 



scribed constraints on the variance matrices (see Appendix for formulas). These 



constraints cannot be satisfied with Mixmod or Mclust. In Johannes (2010) the 



constraints are related to the means which assumes a strong symmetry in the dis- 
tribution of the data. 



2.3 Posterior probabilities for a region 

The posterior probability for each probe to belong to each group 
x tKX = P(Zt = k\X), where X = {X t }, 

1 



is obtained as a by-product of the Forward-Backward algorithm and can be used 
for probe classification. 

As explained above, the probe may not be the relevant biological entity and we 
would rather look at the status of a region like a gene or a transposable element. 
We define a region as a set of probes, that can be decomposed into sub-regions of 
adjacent probes. As a reference to the gene structure and without loss of general- 
ity, we will refer to these sub-regions as 'exons' and to the spaces between them 
as 'introns'. In eukaryotic genes, exons correspond to coding regions that are 
spliced together in the transcript to become the mRNA, after removal of introns, 
which are not expressed. We define the posterior probability for such a region g 
to belong to group k as the probability for all its probes to belong to group k: 

Q gk , x = P(Vteg,z t = k\x,c) (l) 

A region is covered by several probes and our definition considers the case of a 
homogeneous region, that is when all probes have the same status. 
We compute this probability for a gene g with Q exons (and Q — 1 introns). We 
denote e q the position of the first probe of exon q and i q the position of the first 
probe of intron q; thus i q — 1 refers to the last probe of exon q — 1. As convention, 
we denote iq the position of the first probe after the end of the gene. We also 
denote X% = {X t } u < t < v . We get 



Q gk ,x = P(Vteg,Z t = k\X,C) 



h-l 



P{Z ei =k\X{')x Yl A k 




i Q -2 



xB k,Q x t | A u x P(Z,- G _i = k\Z iQ „ 2 = k,Xf 



Q 



where A ks = P(Z S = k\Z s -\ = k,X s ,C), 

B k , q = P(Z eq =k\Z lq _^=kX q \,C), 

where C = {Q}. All these terms can be calculated with the Forward recursion of 
the Forward/Backward algorithm. 

Note that the sum of the Q g k,x for k 6 {1,...4} is not equal to one, as all 
probes from a same gene may not have the same status. Changing the list of 
exons associated to a gene allows us to account for alternative splicing or exclude 
the last exon for which the expression level could be lower due to the labeling 



protocol (Nicolas et al, 2009) 
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3 Applications 



We now illustrate the use of the proposed modeling on both ChlP-chip and tran- 
scriptomic data. All experiments have been carried out on a two-color Nimble- 
Gen array of about 700 000 probes designed to insure a constant hybridization 
temperature. For each dataset two biological replicates are available, for which 
hybridizations are performed in dye-swap. The normalization step is done by av- 
eraging on the dye- swap the two signals of each technical replicate to remove 
the gene-specific dye bias (Mar y-Huard, Picard and Robin[ 2006[ ). Analyses are 
performed per chromosome on the normalized data. 

3.1 ChlP-chip dataset 

We analyse the data from a histone modification (H3K9me2) study in Arabidopsis 
thaliana for a wildtype and a mutant (polIV). We directly compare the ChIP sam- 
ples of the wildtype and the mutant to study their difference in methylation. The 
methylation mainly affects transposable elements but also large adjacent regions 
( Humburg et al.\ 2008| ). Therefore we expect to find enriched probes both in the 



transposable elements and in wide neighboring regions. As the methylation does 
not affect a specific annotation category, the standard annotation information is 
not useful to detect enriched probes. This suggests using a HMM model without 
the annotation knowledge. 

The histone methylation under study is known to be weakly present in the genome 
and the mutant is known to have a loss of methylation compared to the wildtype 



(Bernatavichute et al. 2008). We find consistent results as shown by the esti- 
mated proportions in each group given by our model: 43% noise, 21% identical, 
22% lost in mutant, 14% gain in mutant. 

The studied histone modification is also known as a heterochromatin mark. Most 
regions covered by H3K9me2 are adjacent and cover several megabases in peri- 
centromeric regions or in interstitial heterochromatin regions (tightly packed form 
of chromatin) as the knob of chromosome 4, but there are also smaller regions 
(islands of heterochromatin) located in euchromatin (lightly packed form of chro- 
matin) and covering mainly transposable elements (Bernatavichute et al. 2008). 



The results obtained using our method corroborate this information: 91.3% of 
probes in heterochromatin are methylated whereas only 49.5% of probes in eu- 
chromatin are methylated. In heterochromatin, 82% of probes have identical be- 
haviour between wildtype and mutant whereas only 9.5% of probes are identical 
in euchromatin. Moreover 56% of methylated probes cover transposable elements 
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or a 500 base-pair (bp) surrounding region. The transition probabilities provide 
insights about the length of regions from each group through mean sojourn time. 
The average size of the binding sites is 14.3 probes (corresponding to 3289bps) 
for the identical group, 4.5 probes (1035bps) for the group with lost in mutant, 3.7 
probes (851bps) for the group enriched in mutant and 7.7 probes (1771bps) for 
the noise group. 

These calculations show that impoverished or enriched regions are three times 
smaller than regions with identical behaviour between wildtype and mutant. More- 
over, the transposable elements are 2 to 3 times smaller in the euchromatin com- 
pared to the heterochromatin. This suggests that most of the methylation losses of 
the mutant occur in transposable elements from the euchromatin. 
The transposable element META1 (located between positions 5326458 and 533 1580 
on the chromosome 4) is known to have a loss of methylation in the mutant. The 
regulatory region of META1 is located at the beginning of the transposable ele- 
ment with small RNAs which are involved in methylation process. Our method 
declared the first half of the probes covering META1 (near the start position) in 
the group where methylation is lost. The other probes are declared identically 
methyled between the two samples. This example shows the advantage of the 
high resolution of the tiling array. 



Comparison with the models of Johannes (2010) As in Section |2Tj Groups 1 
and 2 refer respectively to 'noise' and 'identical' probes, whereas Groups 3 and 



4 correspond to differentially enriched probes. Johannes ( 2010[ ) proposed two 
models. A full- switching model (Model 2) where the component means are con- 
strained as follows: ft = (jUi,Mi), £ 2 = (jU 2 ,jU 2 ), M 3 = (>2,Mi)> JU = G"i,M2) 
and the covariances matrices of Groups 3 and 4 are equal (E3 = E4). Model 3 is a 
flexible- switching model similar to the full- switching model (Model 2) with less 
constraints: /I3 = (/l4,/l3) and /I4 = (113,114). We compared our model with their 
models on the H3K9me2 dataset. Model 2 leads to a smaller proportion of differ- 
entially enriched regions (7.8% lost in mutant and 1 .2% gain in mutant, see Figure 
[3]) than our model (respectively 22% and 14%). The transposable element META1 
that is declared differentially enriched with our model (see above) is found in the 
identical group according to their Model 2. The classification of Model 3 seems 
not to be suitable for probes with similar intensities between 8 and 10 where more 
probes are expected to be declared in the identical group (see Figure [3]). In con- 
clusion, it seems that the independence assumption, the symmetrical constraints 
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on the means and, the equal variances for the differentially enriched probes lead 
to a too simple model for analyzing such data. These two models do not also fit 
well the transcriptional dataset defined in Section 3.2 (results not shown). 




Figure 3: Classification comparison between our HMM model and the models 2 
and 3 of |Johannes| ( |20T0l ) on H3K9me2 dataset. Left: Model 2, Middle: Model 3, 
Right: our HMM model. 



3.2 Transcriptional dataset 

We now study the gene differential expression between the leaf and the seed 10 
days after pollination of the plant Arabidopsis thaliana. First we compare the 4 
sub-models, second we present the results by gene, then we consider the detection 
of new transcribed regions. In the results, over-expressed (under-expressed) refers 
to probes with higher (smaller) signal in the seed. 
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Table 1: Fit of the 4 models. 









^3 


^#4 


number of parameters 


18 


30 


24 


60 


—2 log-likelihood 


406248 


371308 


373282 


356616 


BIC 


406457 


371656 


373561 


357312 


ICL 


436185 


416529 


399975 


400330 



3.2.1 Comparison of the 4 models 



The model presented in Section 2.1 uses all available information and is referred 
to as model For the annotation, P = 3 categories are considered: intergenic, 
intron and exon, and only exons are expected to be transcribed. Model ^4 can 
be simplified if either the structural annotation (Model or spatial dependence 
between probes (Model ^#3) is not taken into account. The model is the sim- 
plest model with neither annotation nor spatial dependence. It comes down to 
an independent mixture model. The constraints on the variance matrices detailed 



in Section Z2]are kept in all models. Table [T] presents the fit of the four models 



for chromosome 4. The full model ^#4 achieves the best BIC criterion, suggest- 



ing that all available information should be taken into account. ICL (Biernacki, 



Celeux and Govaertj |2000| ) is an alternative model selection dedicated to classi- 
fication purposes. According to ICL, the model ^3 (without HMM) should be 
chosen, meaning that the annotation information somehow contains the informa- 
tion about the spatial dependence. Nevertheless, since the difference between ^#3 
and ^4 is small in terms of ICL and the number of parameters is far smaller than 
the number of observations, we use the model ^#4 to compare the two transcrip- 
tomic samples. 

As expected, intergenic probes mostly belong to the noise group (84%) and 
few belong to expressed groups: 9% in the under-expressed group and 6% in the 
over-expressed. Intronic probes display a similar, although different, repartition: 
60% noise, 7% identical, 24% under-expressed and 9% over-expressed (cf Section 



3.2.3 for discussion about expressed probes in intergenic and intronic regions). 
As expected, most exonic probes (78%) belong to the expressed groups: 41% 
identical, 23% under-expressed and 14% over-expressed. The transition matrices 
for the intronic and intergenic categories are very similar (not shown): whatever 
the status of probe t, probe t+l has between 70% and 95% chance of being 
noise. This is different for the exonic probes where the transition matrix has high 
probabilities on the diagonal meaning that probe t + l has high probability (80% 
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to 90%) of having the same status as probe t. All these results seem to be coherent 
with what is expected for transcriptomic data. 



3.2.2 Gene classification 

We now consider the classification of each gene. To this end, we compute the 
posterior probability Qgkjt defined in Equation ([T]). We propose to classify the 
genes via a two-step procedure. The probability for a gene to be homogeneous 
whatever the status is Q g k,x- We first verify if the gene has homogeneous status 
by considering a ratio similar to a Bayes factor: £ fe Q g k,x I Hk Qgk, where Q gk = 
P(Vf e g,Z t = k\C) is the non-conditional version of Q g k,x> which is for a gene 



Q gk = m E k x (^^(^M x Y[[{K') e ^ 



lq ]kk , 



where superscripts E and / refer to the exonic and intronic categories, respectively. 
As its computation involves a product of probabilities with as many terms as the 
number of exonic probes in the gene, Q g k,x goes to zero for long genes. The ra- 
tio with Q g k does not correct this effect, therefore we apply an additional linear 
correction on the log-ratio with respect to the length of exons and the number of 
exons in the gene. 

We define this corrected log-ratio as unistatus value which is a tool for decision 
support. 

If the homogeneous assumption seems verified, the second step is to calculate the 
conditional posterior probability Q g k,xfLi Qgi,x to assign the gene to the group k 
for which this posterior probability is the highest. 

We found 8 1 % of genes which have an unistatus value higer than (corresponding 
to 1736 genes). Among these 1736 genes, 920 are declared identically expressed 
in the seed and in the leaf, 318 are declared under-expressed in the seed and 181 
are declared over-expressed in the seed. We focus our analysis on 8 genes clearly 
identified as preferentially transcribed in seeds using graphical outputs of Gen- 



evestigator which is a database of transcriptome analysis results (Zimmermann 



et al. 2004). Among the 8 genes, 7 have an unistatus value higer than and are 



declared over-expressed in seed with our calculation. 
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3.2.3 Detection of new transcripts 



Although our model is built for the comparison of two samples, it also allows 
the detection of previously unknown transcription sites thanks to the high reso- 
lution of the tiling array. To this aim, the model without annotation seems more 
suitable, since we are bringing it into question. A lot of regions with expressed 
probes are found in intergenic regions: 1328 small regions with 2 or 3 consecutive 
expressed probes, 185 regions with 4 or 5 consecutive expressed probes and 90 
regions with more than 5 consecutive probes (including 25 regions with more than 
10 consecutive probes). For the 90 regions with more than 5 consecutive probes, 
we check with other annotation information like Expressed Sequence TAG (EST) 
or genes predicted by the Eugene software ( Schiex et al.\ 200 \) which are not yet 
in the official TAIR annotation. We found 39 regions matching with annotation 
like small RNA, rRNA, tRNA, including 12 regions corresponding to a coding 
sequence defined in Eugene and 10 corresponding to transcriptional units recently 
annotated due to the presence of EST. The Figure [4] (from FLAGdb++ ( |Samson 



et al.\ 2004)) shows examples of results for two annotated genes and also for two 



expressed regions which correspond to EST and Eugene genes. Moreover the 
obtained results show many other interesting things, such as surprisingly many 
transcriptions in the introns in 5'UTR (40% of intronic probes declared expressed 



in Section [3.2.1[ ). This seems to be consistent with a recent article of |Cenik et al. 
(2010) assuming a functional role of 5'UTR short introns. 



4 Discussion 

Tiling array is a powerful technology which requires adapted statistical methods 
to deal with the large quantity and the variability of the data. We focus on the 
comparison of two samples from transcriptomic or ChlP-chip experiments and 
also on the detection of transcribed regions by directly modeling the joint distri- 
bution of the two sample intensities. We consider all the available information 
from the probes: the intensity of the two signals, the dependence between neigh- 
boring probes and the structural annotation. The annotation knowledge is very 
useful information with the aim of classification because of the intrinsic differ- 
ence between exonic or intergenic probes. This method can be used for ChlP-chip 
or transcriptomic data whenever there are two conditions to compare. Both one- 
color and two-color tiling arrays can be analysed. Applications on Arabidopsis 
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Figure 4: Presentation of results in Flagdb++. Circles represent EST, white arrows 
are official TAIR annotation genes and black arrows represent Eugene genes. The 
small rectangles are probes colored according to their status: light if not expressed, 
dark if expressed. On the first 2 lines, there are 2 expressed genes covered by 
a majority of probes with the same status except the intronic probes which are 
reasonably declared as not expressed. The last 2 lines present expressed probes 
where there are no official genes but the signal coincides with EST and/or Eugene 
genes. 

thaliana tiling array show the ability of the model to interpret the data and pro- 
vide a new insight of gene expression or gene expression control as well as new 
biological hypothesis. 

Arabidopsis thaliana is a model plant with a very well-known genome annotation 
but the annotation is not available or unreliable for many organisms. That is why 
the sub-models are also useful. The model without annotation allows us not to be 
limited by the quality of the available annotation and this model is also useful to 
detect new genes to improve the current official annotation. 
This work raises also the question of classification. The results are given by probe 
and by region. We compute a posterior probability by region and we propose 
a procedure of region classification. The most common regions are the genes 
which are non-connected regions, but any other region can be defined. It would 
be interesting to control the False Discovery Rate, i.e. the expected proportion of 
misclassifications, in the case of having 4 groups and under dependence hypothe- 
sis, and also for the results given by region. 



15 



Acknowledgements 



The authors thank F. Roudier and V. Colot from IB ENS for providing ChlP-chip 
data and for helpful discussions to biological interpretation, and S. Derozier and 
A. Lecharny from URGV for their help in bioinformatic analysis. The authors 
are grateful to S. Balzergue from URGV for providing transcriptomic tiling array 
data. This work was funded by MIA, GAP and MICA departments of INRA. 



A Appendix: Computation of the estimates of D and 
A. 

Recall X t = (X t i,X t2 ) the log-intensities for both samples, t varies from 1 to n, 
where n is the total number of observations. Let X k = , where n k = 

n 
t=l 

Let W k = j^tk,x( X t-Xk)(Xt-Xk)' be a matrix like ^ ^ and A ^ = 

u\k 0\ 
u 2 ) ' 
with u\ k > u 2 ,for k = 1, ...,4. 

The maximum likelihood estimator of the orientation matrix D identical for the 

d -J~\ -d 2 ^ 

first two components with the same orientation is in the form of 



d 



where d is the minimum of the function: 



2 



f ^ = y i d 2 w lk + 2w 2k dVl -d 2 + w 4k (l - d 2 ) + d 2 w 4k + 2w 2k dVl -d 2 + w lk (l - d 2 ) 
j~i \ u ik u 2 

The estimator of J is defined by: 

d 2 - \ = ±— — withJ>0, 



2 



{N h4 } +4{N 2 y 



2 2 

whereM ;4 = {wi k -w 4k )(u 2 -u\ k )/u\ k u 2 and N 2 = £ {w 2k )(u 2 -u\ k )/u\ k u 2 . 

k=l k=l 
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( 



hk b 2 k 



) 



. The maximum 



) 



, where 



u\k =bik/n k 
U2 =Li=l b 2k/n. 
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